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VOICE RECOGNITION SYSTEM 

BACKGROUND OF THE INVENTION 

1. Field of the Invention 

The present invention relates to a voice recognition 
system, and specially relates to the speaker adaptive type voice 
recognition system which is robust to the noise - 

2. Description of the Related -Art 

In the related art, a system shown in Fig. 9 is well known 
as a speaker adaptive voice recognition system, for example. 

This voice recognition system is provided with a 
previously prepared standard acoustic model 100 of an 
unspecified speaker, and a speaker adaptive acoustic model 200 
is prepared by using a feature vector of an input signal Sc 
generated from an input voice uttered by a specified speaker, 
and the standard acoustic model 100, and the voice recognition 
is conducted by adapting the system to the voice of the specified 
speaker, 

when the adaptive acoustic model 200 is prepared, the 
standard vector Va corresponding to a designated text (sentence 
or syllable) Tx is supplied from the standard acoustic model 
100 to a path search section 4 and a speaker adaptation section 
5, and further, actually/ by uttering the designation text Tx 
by the specified speaker, the input signal Sc is inputted. 

Then, after an additive noise reduction section 1 removes 
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an additive noise included in the input signal Sc, a feature 
vector generation section 2 generates a feature vector series 
Vcf which represents the feature quantity of the input signal 
Sc. Further/ a multiplicative noise reduction section 3 removes 
a multiplicative noise from the feature vector series Vcf, and 
generates the feature vector series Vc from which the 'additive 
noise and the multiplicative noise are removed. The feature 
vector series Vc is supplied to a path search section 4 and 
a speaker adaptation section 5, 

In thismanner, when the standard vector Va and the feature 
vector series Vc of the input signal Sc actually uttered are 
supplied to the path search section 4 and the speaker adaptation 
section 5, the path search section 4 compares the feature vector 
series Vc to the standard vector Va. Then, the appearance 
probability of the feature vector series Vc for each syllable, 
and the state transition probability from an syllable to another 
syllable are found. Thereafter, when the speaker adaptation 
section 5 compensates for the standard vector Va according to 
the appearance probability and the state transition probability, 
the speaker adaptive acoustic model 200 adaptive to the feature 
of the voice (input signal) proper to the specified speaker 
is prepared. 

Then, the speaker adaptive acoustic model 200 is adapted 
to the input signal generated from the uttered voice by the 
specified speaker. Thereafter, when the specified speaker 
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utters arbitrarily, the feature vector of the input singal 
generated from the uttered voice is collated with the adaptive 
vector of the speaker adaptive acoustic model 200 / and the voice 
recognition is conducted in such a manner that the speaker 
adaptive acoustic model 200 which gives the highest likelihood 
is made a recognition result* 

In this connection, in the above conventional adaptation 
type voice recognition system/ when the adaptive acoustic model 
200 is prepared, the additive noise reduction section 1 removes 
the additive noise by the spectrum subtraction method, and the 
multiplicative noise reduction section 3 removes the 
mulatiplicative noise by the GMN method (cepstrum means 
normalization) , and thereby, the speaker adaptive acoustic 
model 2 00 not influenced by the noise is prepared- 

That is, the additive noise reduction section 1 removes 
the spectrum of the additive noise from the spectrum of the 
input signal Sc after the spectrum of the input signal Sc is 
found. The mulatiplicative noise reduction section 3 subtracts 
the time average value from the cepstrum of the input signal 
Sc after the time average value of the cepstrum of the input 
signal Sc is found. 

However, also in any of the spectrum subtraction method 
and the CMN method, it is very difficult to remove only noise. 
Because there is a case where the feature information of the 
utterance of the speaker proper to be compensated for by the 
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speaker adaptation is alsomissed, the adequate speaker adaptive 
acoustic model 200 cannot be prepared. Therefore, there is a 
problem that the voice recognition rate is degraded, 

SUMMARY OF THE INVENTION 
An ob j ect of the present invention is to provide a speaker 
adaptive type voice recognition system which is robust to the 
noise, to attain the increase of the voice recognition rate. 

In order to attain the above object, there is 
provided a voice recognition system comprising: 

a standard acoustic model having a standard vector 
generated according to information on voice; 

a first feature vector generation section for reducing 
noise from an input signal generated from an uttered voice 
corresponding to a designated text to generate a first feature 
vector; 

a second feature vector generation section for generating 
a second feature vector from the input signal having the noise; 
and 

a preparation section for generating an adaptive vector 
based on the first feature vector, the second feature vector 
and the standard vector, and preparing a speaker adaptive 
acoustic model suitable for the uttered voice. 

According to the present invention, the preparation 
section compares the first feature vector with the standard 
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vector to obtain a path search result; and 

the preparation section coordinates the second feature 
vector with the standard vector according to the path search 
result to generate the adaptive vector* 

According to the present invention, the noise includes 
additive noise and multiplicative noise- 

According to the present invention, the first feature 
vector generation section includes an additive noise reduction 
section for reducing the additive noise from the input signal 
to generate an additive-noise reduced signal. 

According to the present invention, the additive noise 
reduction section applies a transformation to the input signal 
to generate a first spectrum and subtracting an additive noise 
spectrum corresponding to the additive noise from the first 
spectrum. 

According to the present invention, the first feature 
vector generation section includes a cepstrum calculator for 
applying cepstrum calculation to the additive-noise reduced 
signal , 

According to the present invention, the first feature 
vector generation section includes a multiplicative noise 
reduction section for reducing the multiplicative noise by 
subtracting the multiplicative noise from the first feature 
vector « 

According to the present invention, the first feature 
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vector contains a plurality of time-series first feature 
vectors; and 

the multiplicative noise reduction section calculates 
a time average of the time-series first feature vectors for 
estimating the multiplicative noise* 

According to the present invention, the second feature 
vector generation section applies at least cepstrum calculation 
to the second spectrum to generate the second feature vector . 

According to such the structure/ in the case of speaker 
adaptation, the first feature vector - generation section 
generates the first feature vector except for the additive noise 
of the peripheral circumstance surrounding the speaker or the 
multiplicative noise such as transmission noise of the present 
voice recognition system itself- The second feature vector 
generation section generates the second feature vector 
including the additive noise of the peripheral circumstance 
surrounding the speaker or the feature of the multiplicative 
noise such as transmissionnoise of thepresent voice recognition 
system itself- Then, the preparation section generates the 
adaptive vector by compensating the standard vector according 
to the first feature vector not including the noise and the 
second feature vector including the noise. Therefore the 
adoptive vector generates the updated speaker adaptive acoustic 
model which is adaptive to the voice of the speaker. 

As described above, according to the feature vector not 
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including the noise and the feature vector including the noise/ 
the standard vector in the standard acousticmodel is compensated 
for. Therefore, the speaker adaptive acoustic model 
corresponding to the practical utterance circumstance can be 
prepared, and the voice recognition system being robust to the 
noise and having the higher voice recognition rate can be 
realized. 

Further, the second feature vector generation section 
generates the feature vector without removing the additive noise 
or multicative noise and the feature -vector is used for the 
speaker adaptation. Therefore, the feature information of the 
original voice is not removed, and the adequate speaker adaptive 
acoustic model can be generated, 

BRIEF DESCRIPTION OF THE DRAWINGS 

Fig. 1 is a block diagram showing the structure of a voice 
recognition system of an embodiment of the present invention. 

Fig. 2 is a table typically showing the structure of a 
standard acoustic model. 

Fig. 3 is a table showing feature vector series [s if M ] 
generated in a feature vector generation section 12 at the time 
of speaker adaptation. 

Fig. 4 is a table showing feature vector series [c ir M J 
outputted from a multiplicative noise reduction section 9 at 
the time of speaker adaptation. 
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Fig. 5 is a table showing the corresponding relationship 
of the feature vector series [ci, M ] with a standard vector [a n , 
M ] according to the frame nuiober and the state number. 

Fig* 6 is a table showing the relationship of the feature 
vector series [Ci, rt ], the standard vector [a n , M ]r the frame number 
and the state number* 

Fig- 7 is a table showing the relationship of the average 
feature vector generated by the speaker adaptation with the 
standard vector . 

Fig. 8 is a table showing the content of the speaker 
adaptive acoustic model after update. 

Fig . 9 is a block diagram showing the structure of a speaker 
adaptation type voice recognition system in the related art. 

DETAILED DESCRIPTION OF THE PRESENT INVENTION 
Referring to the drawings, the present invention will 
be described below with reference to the accompanying drawings. 
In this connection, Fig. 1 is a block diagram showing the 
structure of a voice recognition system according to an 
embodiment of the present invention. 

In Fig- 1, the voice recognition system comprises the 
standard acoustic model (hereinafter, referred to as [standard 
voice HHM] ) 300 of an unspecified speaker previously prepared 
by using the Hidden Markov model (HMM) and the speaker adaptation 
acoustic model (hereinafter, referred to as [adaptive voice 
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HMM] ) 400 prepared by the speaker adaptation. 

In this connection, to easily understand the embodiment 
of the present invention, the state number of the standard voice 
HMM 300 is defined as 1, Further, the standard voice HMM 300 
has an appearance probability distribution for each syllable, 
and an average vector of the appearance probability distribution 
is to be a standard vector. 

Accordingly, as typically shown in Fig, 2, the standard 
voice HMM 300 has a M dimensional standard vector [a n/ m] for 
each syllable* That is, when the standard voice HMM 300 is 
prepared, for example, voice data generated from an uttered 
voice by one or plurality of speakers {unspecified speakers) 
under silent environment is framed for each predetermine time. 
The framed voice data is successively cepstrum-operated, to 
generate the feature vector series in the cepstrum domain for 
a plurality of frames for each syllable. Obtaining the average 
of the feature vector series for a plurality of frames prepares 
the standard voice HMM 300 composed of the standard vector [a n / 
M ] for each syllable. 

Herein, a variable n of the standard vector [a n # m] 
expresses the state number to recognize each syllable, and a 
variable M expresses the dimension of the vector . For example, 
the Japanese syllable [A] corresponding to the state number 
n = 1 is characterized as the M dimensional standard vector 
[an/ M ] = [ai/ it a L , z , ai, i, a x , M ] / and the Japanese syllable 



- 9 - 



01-. 9-27; 5 : 51 PM: NGB 



MORGAN, LEWIS ;81355813964 



# A- 16 



[I] corresponding to the state number n = 2 is characterized 
as the M dimensional standard vector [a n , M ] = [a 2/ ir a 2 , zt 

a 2 , 3 , a 2 , M ] . The same rule applies correspondingly to the 

following, and the remaining syllables are also characterized 
as the M dimensional standard vector [a n/ m] distinguished by 
the state number n. 

At the time of the speaker adaptation which will be 
described later, the standard voice HMM 300 is supplied with 
the designated text Tx of the previously determined sentence 
or syllable, and the standard vector [a n , M ] corresponding to 
the syllable structuring the designated text Tx is supplied 
to the path search section 10 and the speaker adaptation section 
11, according to the arrangement sequence of the syllable. 

For example, when the designated text Tx of Japanese 
[KONNICHIWA] is supplied, the standard vectors corresponding 
to respective state numbers n - 10, 46, 22, 17, 44 expressing 
{KO], [N] r [NI], [CHI] , [WA], [a i0 ,i, aio,2/ aio, 3w . a 10 , H ] , 

[a<6, 1, a46/ 2, a46, 3/ - * • 3^6, m] r [^22,1/ a 22. It &22 , If - - - ^22, M ] , 

[&17, i, an, 2/ ^17/ 3/ - — a i7, m] / and [a^, i/ zt &aa, 3f • * - 

a**, M ] are supplied to the path search section 10 and the speaker 
adaptation section 11 in order. 

Further, the voice recognition system of the present 
invention is provided with a framing section 6, additive noise 
reduction section 7, feature vector generation section 8, 
multiplicative noise reduction section 9, and feature 
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vector generation section 12, 

The framing section 6, when the specified speaker actually 
utters the designated text Tx at the time of the speaker 
adaptation, divides an input signal Sc into generated from the 
uttered voice frames for each predetermined time (for example, 
10-20 msec) and outputs it to the additive noise reduction 
sections 7, 13 and feature vector generation section 12. 

The additive noise reduction section 7 successively 
conducts Fourier transformation on each framed input signal 
Scf divided into each frame to generate a spectrum for each 
frame. Further, the additive noise included in each spectrum 
is removed in the spectrum domain to output the spectrum. 

The feature vector generation section 8 conducts the 
cepstrum operation on the spectrum having no additive noise 
for each frame to generate the feature vector series [c if M ] r 
in the cepstrum domain. In this connection, the variable i 
of the feature vector series [ci, m!T expresses the order (number), 
and the variable M expresses the dimension. 

The multiplicative noise reduction section 9 removes the 
multiplicative noise from the feature vector series [cj,, m 1 ' 
by using the CMW method. That is, a plurality of the feature 
vector series [ci, H ] ' obtained for each frame i by the feature 
vector generation section 3 are time-averaged for each dimension . 
When the M dimensional time average value [c A M ] obtained thereby 
is subtracted from each feature vector [c L/ M ]' to generate the 
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feature vector series [c±, M ] from which the maultiplicative 
noise is removed. The feature vector series [c±, M ] thus generated 
is supplied to the path search section 10. 

The feature vector generation section 12 generates 
spectrum for the frame when each framed input signal Scf divided 
for frame outputted from the framing section 6 is successively 
Fourier-transf ormed. Further, when each spectrum is 
cepstrum-operated for each frame, the feature vector series 
[Si, M ] in the cepstrum domain is generated and supplied to the 
speaker adaptation section 1 1 . In this connection, the variable 
i of the feature vector series [s ir M ] expresses the order for 
each frame, and the variable M expresses the dimension. 

In this manner, the designation text Tx, standard vector 
[a n , m] and feature vector series [c ir M ] are supplied to the path 
search section 10* The designation text Tx, standard vector 
[a n , M ] andfeature vector series [s i/M ] are supplied to the speaker 
adaptation section 11. 

The path search section 10 compares the standard vector 
[a ft , M ] to the feature vector series [c if H ] , and judges which 
syllable of the designation text Tx corresponds to the feature 
vector series [ci, M ] for each frame. The path search result 
Dv is supplied to the speaker adaptation section 11. 

The speaker adaptation section 11 divides the feature 
vector series [s i# M ] from the feature vector generation section 
12 into each syllable according to the path search result Dv. 
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Then, the average is obtained for each dimension with respect 
to the feature vector series [s^, M ] for each divided syllable. 
Eventually, the average feature vector [s A n , w ] for each syllable 
is generated. 

Further, the speaker adaptation section 11 finds a 
difference vector fd n , M ] between the standard vector [a n , M ] of 
each syllable corresponding to the designation text Tx, and 
the average feature vector [s " n , M ] - Then, conducting the 
average operation on these difference vectors [d n , m] leads to 
finding the M dimensional movement vector [m M ] expressing the 
feature of the specified speaker. Further, the adaptation 
vectors [x n , m] for all syllables are generated by adding the 
movement vector [ro M ] to the standard vectors [a n , M ] of all 
syllables from the standard voice HMM 300. The adaptive. voice 
HMM 400 is updated by these adaptation vectors [x n , M ] - 

Next, referring to Fig. 2 - Fig. 8, the function of the 
path search section 10 and the speaker adaptation section 11 
will be described in detail. 

In this connection, the designation text Tx of Japanese 
[KONNICHIWA] is used as a typical example. 

Further, it is defined that the input signal Sc of Japanese 
[KOHNICHIWA] uttered by the speaker is divided into 30 frames 
by the framing section 6 and inputted. 

The standard voice HMM 300 is, as shown in Fig. 2, prepared 
as the standard vector [a n , M ] of the unspecified speaker 
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corresponding to each of a plurality of syllables. Further, 
each syllable is classified by the state number n. 

Further, the adaptive voice HMM 400 is set to the same 
content (default setting) as the standard vector [a n , m] of the 
standard voice HMM 300 before the speaker adaptation, as shown 
in Fig. 2. 

At the beginning of the speaker adaptation processing, 
the designation text Tx of Japanese [KONNICHIWA] is supplied 
to the standard voice HMM 300. Then, supplied to the path search 
section 10 and the speaker adaptation section 11 are the standard 
vector [aio, i f aio, 2, aio, 3, « . « aio, m] corresponding to the 
state number n = 10 expressing the syllable [KO] , the standard 
vector [a 46 , 1 , a 46 , 2/ *4s f 3/ - - - - a 4 s, M ] corresponding to the 
state number n ^ 46 expressing the syllable [N] , the standard 
vector [azz, 1 , a 2 2, z, a 22 , 3/ . a 2 2, m3 corresponding to the 
state number n = 22 expressing the syllable [NI], the standard 
vector [ai7, 1 / ai7, 2/ ai7, 3/ . a^?, m] corresponding to the 
state number n ~ 17 expressing the syllable [CHI], and the 
standard vector [a«4, 1 , a<4, z, a^, 3/ - * - - a<4 r k) corresponding 
to the state number n = 44 expressing the syllable [MA] . 

Next, when the speaker utters [KONNICHIWA] , the framing 
section 6 divides the input signal Sc into 30 frames according 
to the lapse of time, and outputs the divided input signal Sc. 
Then, the feature vector generation section 12 generates the 
feature vectors [si, 1, s\ f z, s x , 3/ , - . Si, m! ~ [S30, ir s 3 o, 2, s 30 , 
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zr . - - s 30 , m] of the framed input signal Scf according to the 
order of each frame, and supplies to the speaker adaptation 
section 11. 

That is, as typically shown in Fig. 3, the feature vector 
generation section 12 generates the feature vector series for 
30 frames of i = 1 - 30, [si, M 3 - [si, i, Si, 2 , Si, 3 , - s lr M ] 
- [s 30f i/ S30, 2/ s 30 , 3/ • - . s 3 o, m! f and supplies to the speaker 
adaptation section 11. 

On the one hand, the processing system includes the 
additive noise reduction section 7 , feature vector generation 
section 8, and multiplicative noise reduction section 9. In 
the processing system, the feature vector series [c±, M ] = [ci, 

1, Ci f 2/ Ci, 3, - - - Ci, m] " [C30, If ^30, It C30, 3, « • * C 3 Q, m] for 

30 frames of i = 1 - 30 are generated according to the framed 
input signal Scf of each frame supplied from the framing section 
6, and supplied to the path search section 10, That is r as 
typically shown in Fig. 4, the feature vector series for 30 
frames [ci, m] = [c if i, c ir 2f c Xf 3 , . . . c lf M ] - [C3o, if C30, a c 30 , 
3/ . . . C30, M ] are supplied to the path search section 10 through 
the multiplicative noise reduction section 9. 

The path search section 10 compares the feature vector 
series [ci, M ] for 30 frames to the standard vector [a n , M ] 
corresponding to each syllable of the designation text Tx, by 
the methods of Viterbi algorithm or forward backward algorithm, 
and finds which syllable corresponds to the feature vector series 
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[ci, m] at each moment for each frame. 

Thereby, as shown in Fig, 5, each frame number i of 3 0 
frames is coordinated to each state number n expressing each 
syllable of [KONNICHIWA] . Then, the coordinated result is 
supplied to the speaker adaptation section 11 as the path search 
result Dv. 

The speaker adaptation section 11 coordinates the feature 

vectors [Si, 1/ Si, 2/ Si, 3/ . - - Si, m] ~ [S30, lr s 3 q, 2/ 3/ 

S30, m] to the standard vectors [aio, 1 * aio, 2/ aio, 3/ * - - - aio, 
m! f [a<g, 1 / a46 f it a<e, 3/ . - - - a«6, m] / [a22, \ * &zz r if ^22, 3/ 
a22, m] / [an, 1 f an, 2r &n, 3/ * • - * an, m] * [&44 r 1 / £44, 2/ 3.44, 
3/ .... a 44 , m] / according to the path search result Dv. 

That is, as shown in Fig, 6, the standard vector [aio, 
1 f aio, 2/ aio, 3f . . . . a.10, m] is coordinated to the feature vector 

[Si, 1/ Si, 2/ Si, 3/ - ■ - Si, m] ~ [Sfi, 1/ Sfi, 2/ S§, 3/ — * M ] Ot 

the frame number i « 1 - 6 corresponding to the syllable [KO] 
obtained by the path search. The standard vector [a^, 1 , a 46 , 
2 f a*s, 3/ . ... a d <5, m! is coordinated to the feature vector [s 7 , 

1/ S7, 2/ Si, 3r - . . S7, m] ~ [SlO, 1/ Sio, 2/ Sio, 3, * - - Siq, m] Of the 

frame number i - 7 - 10 corresponding to the syllable [N] . 

Further, the standard vector [a^, i , a 2 2, z, a 2 2, 3/ - - ■ ■ 
^22, m] is coordinated to the feature vector [su, if Sn, 2/ Su, 
3/ . - • Sn, m] " [Si4 r i/ Si4, 2/ Si 4r 3/ - - « s H , M ]of the frame number 
i = 11 - 14 corresponding to the syllable [NI] , The standard 
vector [an, i > a.i7 # 2/ an, 3/ . . . - an, m] is coordinated to the 
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feature vector [sis, i, Sis, it Sis, - - - Sis, mJ " [sis, i/ s i8 , z, 
s ig , 3/ * . - sia r M]of the frame number i = 15 - 18 corresponding 
to the syllable [CHI] * The standard vector [a 44 , i , a 44 , tf a^4, 
3 , . . a< d , M ] is coordinated to the feature vector [Sid, i/ Sid, 
2/ Sis, 3^ - ■ ■ s i9, m] ~ [S30, 1/ s 30 , 2/- s 30 , 3/ - - * s 30 , «]of the frame 
number i = 19 - 30 corresponding to the syllable [WA] . 

Next, the speaker adaptation section 11 divides the 
feature vectors [s ir i, si, 2, si, 3, . . ■ si, m] - [sao, i, s 3 q, 2/ S30, 
3, sso f m] for 30 frames shown in Fig* 6, for each syllable 
of [KQ], [N], [NI], [CHI], [WA] . As shown in Fig. 7, the average 
feature vector for each syllable of [KO] , [N] , [NI] , [CHI], 
[WA] , Is A n , M J is generated by obtaining the average for each 
divided feature vector. 

That is, relating to the feature vectors [s L , i , Si, 2 r 
si, 3/ . - • / si, M ] - [se, i/ se, 2/ se, 3, . . . , se, M 3 of the first 
-sixth frames (frame number k = 6) corresponding to the syllable 
[KO] shown in Fig. 6, as shown by the following expression (1) , 
the first dimensional 6 elements s L , i - Se, i are added, and 
the first dimensional element s A n , i of the average feature 
vector [s A n , m] is obtained by multiplying the added value 
(si, i + S 2, i + s 3 , i + 54, i + 5s, i + s 6 , i) by frame number k = 
6. Further, in the same manner as to the second dimensional 
6 elements si, 2 - s&, zr the added value (si, z + $z, 2 + s 3 , 2 + 
$a, 2 + $5, 2 + s 6/ 2) is obtained. Then, the second dimensional 
element s " n , 2 of the average feature vector [s A n , m] is obtained 
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by multiplying it by frame number k = 6, In the same manner 
as also in the following, the element s " n , M up to the M 
dimensional 6 element Si, M - s 6 , m is obtained, and the M dimensional 
average feature vector [s V i/ s A n, z, s V 3/ - • • s A n , m! 
corresponding to the syllable [KO] composed of these M 
dimensional elements s A i - s \ M is generated, 



S A n , M = (Si,l + S 2 ,l + S 3 ,l + 5 4 ,i + S 5 ,l + S S< l)/k (1) 

where the variable k in the expression (1) is the frame 
number in each syllable; 

the variable n is the state number to distinguish each 
syllable; and 

the variable M expresses the dimension. 

Accordingly, the variable n in the expression (1) is 
n = 10, and the M dimensional average feature vector 
corresponding to the syllable [KO] is {s A 10 , i/ s A io, 2, s 

A 1Q, 3/ * * * S ^10, m] * 

Further, the average feature vector [s A 4 6, w . . . s A 4 s, 
M ] corresponding to the remaining syllable [N] , the average 
feature vector [s " 22, 1/ * - - s ^22, m] corresponding to the syllable 
[NI], the average feature vector [s ^ n, 1, . s ^17, m3 
corresponding to the syllable [CHI 3, and the average feature 
vector [s A 44, 1, ... s A <«, M ] corresponding to the syllable 
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[WA] f are also obtained in the same manner* 

Next/ according to the next expression (2) , the difference 
vector [dio, i, - • -dio, m] / [die, i/ * * -d 4 s, m] t {&zz t ir - - -d 2 2, M ] , (dn, 
1/ - - .di?, m]/ and [d 44 , 1/ . * -d 44 , M ] between the average feature 
vector [s A i 0 , 1/ - - - s A 10, ] / t s A 4€, 1 / • • ■ s m] , [s "22,1/ 
s "22, m] / [s A n, 1, - . . s A i7 , M ] f [s "44, 1 ■ ■ . s A 44 , m] corresponding 
to each of syllables [KO] , [N] , [NI], [CHI], and [WA] , and the 
standard vector [aio, ir - • - aio, m] / [a 4 e, 1/ . - - a>i6, m] , [a^, 1/ - - ■ a22, 
M ] / [an, 1/ - . - a i7 , M ] , and [a 44 , 1 , . ♦ , a 44 , M ] / are respectively 
3% obtained. 

|! d n , 5 = s A *, 3 - a n ,3 ... (2) 

^ where the variable n in the expression (2) shows the state 

numbers n= 10, 46, 22, 17, 44, corresponding to each of syllables 
5 [KO], (N], [NI], [CHI], [WA]; and 

M the variable j shows each of dimensions j = 1 - M of the 

vector . 

Then, the obtained difference vectors [die, 1, - - -dio, m3 / 
[d 46 , if - - -cUe, m! / [daz, 1/ . . *d 2 2, m] / [di7, 1, - - .dr;, M ] / and tdi 4 , 
if . . -d 44 , M ] are applied to the next expression (3). The M 
dimensional movement vector [m^] = [mi, m 2/ m^] of these 

5 (V=5) syllables of [KO] , [N] , [NI], [CHI], [WA] is obtained 
from the average for each dimension. 
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where the variable j in the expression (3) shows each 
dimension j = 1 - M of the vector; 

the variable n shows the state numbers n = 10, 4 6, 22, 
17, 44 corresponding to each of syllables [KO] , [N] , [NIJ, [CHI], 
[WA] ; and 

the variable V shows the number of syllables {V = 5) . 
~S Thus obtained movement vector [m M ] = [mi, m 2 , . m M ] 

expresses the feature of the specified speaker- Then, as shown 

IT by the next operational expression (4), the adaptive vector 

%s * 
%. i 

?t [x n , M ] having the feature proper to the speaker is obtained from 

* addition of the movement vector [m M ] to the standard vector 
[a n ,, M l of the all syllables . Further, as shown in Fig. 8, the 

CH processing of the speaker adaptation is completed by updating 

M* the adaptive voice HMM 400 by the obtained adaptive vector [x n , 

M ] * 

[x n/ m] = [a n , M ] + [m M ] ... (4) 

It is described hereinabove that the adaptive voice HMM 
4 00 has the speaker adaptation according to the designation 
text Tx of [KONNICHIWA] < However, when the adaptive voice KMM 
400 has the speaker adaptation according to the designation 



- 20 - 



)1- 9-27; 5 : 51 PM ; NGB 



)RGAN, LEWIS ;81 355613954 



# A- 27 



text Tx including other syllables, all the syllable in the 
adaptive voice HMM 400 can also have the speaker adaptation. 

Next, after the speaker adaptation generates the adaptive 
voice HMM 400, when the specified speaker conducts an arbitrary 
utterance, the framing section 6 divides the input signal Sc 
into the frames for each predetermined time (for example, 10 
- 20 msec) in the same manner as the above. Then, the framing 
section 6 outputs the framed input signal Scf of each frame 
according to the lapse of time/ and supplies to the additive 
noise reduction section 13. 

The additive noise reduction section 13, in the same manner 
as the above additive noise reduction section 7 , conducts Fourier 
transformation on each framed input signal Scf divided for frame, 
and generates the spectrum for each frame . Further, the additive 
noise reduction section 13 removes the additive noise included 
in each spectrum in the spectrum domain, and outputs the spectrum 
to the feature vector generation section 14* 

The feature vector generation section 14, in the same 
manner as in the above feature vector generation section 8, 
conducts the cepstrum operation on the spectrum having no 
additive noise for frame, generates the feature vector series 
[yi/M]' in the cepstrum domain, and outputs to the multiplicative 
noise reduction section 15. 

The multiplicative noise reduction section 15, in the 
same manner as in the above multiplicative noise reduction 
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section 9, removes the multiplicative noise from the feature 
vector series [y i/M ] ' by using the CMN method, and supplies the 
M dimensional feature vector series [yi, M ] having no 
multiplicative noise/ to the recognition section 16. Here, 
the variable i of the feature vector series [y 1/M ] expresses 
the frame number. 

As described above, when the feature vector series [ y i , m J 
based on the input signal generated from the actually uttered 
voice is supplied to the recognition section 16, the recognition 
section 16 collates the feature vector series [yt/«] with the 
adaptive vector [x n , M ] of the adaptive voice HMM 400 in which 
the speaker adaptation is conducted, and outputs the adaptive 
voice HMM 400 which gives the highest likelihood as the 
recognition result. 

As described above, according to the voice recognition 
system of the present invention, when the specified speaker 
utters the designation text Tx upon speaker adaptation, the 
additive noise reduction section 7, feature vector generation 
section 8 andmultiplicative noise reduction section 9 generate 
the feature vector series [Ci, m] from which the additive noise 
and multipicative noise are removed. The feature vector 
generation section 12 generates the feature vector series [s ir 
M ] according to the framed input signal Scf including the additive 
noise and multipicative noise. The path search section 10 and 
speaker adaptation section 11 generate the adaptive vector [xi, 
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M ] according to these feature vector series [c irM ]/ feature vector 
series [s ± , M ] r and standard vector [ai, M ] . The adaptive vector 
[Xi, wl in which the speaker adaptation is conducted updates 
the adaptive voice HMM 400. 

Accordingly, the feature vector series [si, M] including 
the feature of the noise (additive noise) of the peripheral 
circumstance surrounding the specified speaker, or 
transmission noise (multiplicative noise) of the present voice 
recognition system itself, is used for the speaker adaptation. 
Therefore, the adaptive voice HMM 4 00 corresponding to the actual 
utterance circumstance can be generated from the voice 
recognition system which is robust to the noise and whose voice 
recognition rate is high. 

Further, in the speaker adaptation type voice recognition 
system in the related art, at the time of the speaker adaptation, 
the generation of the feature vector from which the additive 
noise and multiplicative noise are removed misses the feature 
information of the utterance proper to the speaker to be 
compensated for by the speaker adaptation. There is a problem 
that the adequate speaker adaptive acoustic model cannot be 
prepared. 

On the other hand, according to the voice recognition 
system of the present invention, the feature vector generation 
section 12 generates the feature vector series [s ip M ] without 
removing the additive noise and multiplicative noise. The 



- 23 - 



01- 9-27; 5 : 51 PM : NGB 



MORGAN, LEWIS 



;81355813954 



# A- 30 



feature information of the utterance proper to the speaker to 
be compensated for by the speaker adaptation is not missed 
because the feature vector series [si, M ] is used for the speaker 
adaptation. Therefore, the adequate speaker adaptive acoustic 
model can be prepared to increase the voice recognition rate. 

In this connection, in the present invention, it is 
described that the adaptive voice HMM 400 on the basis of the 
syllables such as Japanese [AIUEO] is prepared. However, it 
is not limited to only the syllable, but, the adaptive voice 
M HHM 400 based on the phoneme can be prepared, 
2 Further, in the present invention, a simple example is 

j! taken as an example, and the method of the speaker adaptation 
^ is described. However, the speaker adaptation method of the 

^ present invention can be adapted for other various speaker 

41 adaptation methods in which the standard vector [a n , M ] is 

R = 

3| coordinated with the feature vector series [s ir M ) or [c iF M ] 

hk of the speaker adaptation. According thereto, the speaker 

adaptive acoustic model can be generated. 

As described above, according to the voice recognition 
system of the present invention, when the speaker adaptation 
is conducted, the feature vector from which the additive noise 
or the multiplicative noise is removed, and the feature vector 
including the feature of the additive noise or themultiplicative 
noise are generated. According to the feature vector not 
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including noise and the feature vector including the noise, 
the standard vector is compensated for. Because the speaker 
adaptive acoustic model adaptive for the utterance proper to 
the speaker is prepared, the speaker adaptive acoustic model 
adaptive for the actual utterance circumstance can be generated. 

Further, because the feature vector is used for the speaker 
adaptation without removing the additive noise or 
multiplicative noise, the feature information of the utterance 
proper to the speaker to be compensated for by the speaker 

adaptation, is not missed. Therefore/ an adequate speaker 

adaptive acoustic model can be generated. 

Therefore, a voice recognition system being robust to 

the noise and whose voice recognition rate is high can be 

obtained. 
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